This assesses the extent to which coding sequences are depleted in ATGs. It extends to a more general analysis of codon frequencies dependent on position and frame, metagene-style. This uses the best-transcript annotation and corresponding start codon position and sequence map made by Corinne Maufrais in June 2018.
This is smoothed to make it readable.
GAA, GAT, GAG. Why?
But not GAC. Why both E codons only one D codon?
Mostly serine. Is that the N-end rule for degradation at work or a translation initiation phenomenon?
## # A tibble: 421,302 x 5
## Pos Codon n Freq Frame
## <int> <fct> <int> <dbl> <fct>
## 1 1 ATG 6634 1 0
## 2 2 AAA 49 0.00739 0
## 3 2 AAC 86 0.0130 0
## 4 2 AAG 87 0.0131 0
## 5 2 AAT 28 0.00422 0
## 6 2 ACA 130 0.0196 0
## 7 2 ACC 146 0.0220 0
## 8 2 ACG 51 0.00769 0
## 9 2 ACT 127 0.0191 0
## 10 2 AGA 51 0.00769 0
## # ... with 421,292 more rows
Also not smoothed.
ATG and indeed NTG codons are strongly depleted in frame 2, consistent with TGN codons depleted in frame 0 due to avoiding premature termination and the rare amino acids Cysteine and Tryptophan.
ATG and some other ATN codons are depleted in frame 1. Most codons NNG are depleted near the start in frame 1, as are many GNN in frame 0 (GAA, GAG, GAT, GGA,GGG,GGT). In fact G-rich codons seem to be depleted near-start relative to C-rich codons.
There is so much data here it is difficult to judge.
## # A tibble: 713,522 x 6
## Pos Codon n Freq Frame ascore
## <int> <fct> <int> <dbl> <fct> <chr>
## 1 1 ATG 4708 1 0 hi
## 2 2 AAA 27 0.00573 0 hi
## 3 2 AAC 63 0.0134 0 hi
## 4 2 AAG 58 0.0123 0 hi
## 5 2 AAT 17 0.00361 0 hi
## 6 2 ACA 93 0.0198 0 hi
## 7 2 ACC 107 0.0227 0 hi
## 8 2 ACG 33 0.00701 0 hi
## 9 2 ACT 85 0.0181 0 hi
## 10 2 AGA 36 0.00765 0 hi
## # ... with 713,512 more rows
Malabat et al compare change of in-frame to out-frame depletion by
For each codon:
EW implemented this, using fixed-width 10-codon windows instead of the smoothed windows from Malabat et al.
We need to count ATGs with good vs bad context. Maybe just a -3 A would work to give initial picture? Plot levels of TNNATG, CNNATG, ANNATG, GNNATG?
## # A tibble: 1,853 x 3
## # Groups: Pos [585]
## Pos sixmer n
## <int> <chr> <int>
## 1 1 ATGAAG 1
## 2 1 ATGAGC 1
## 3 1 ATGCCC 1
## 4 1 ATGCCT 1
## 5 1 ATGGCA 1
## 6 1 ATGGTT 1
## 7 2 AACTCC 1
## 8 2 ACCTAC 1
## 9 2 CAGTAT 1
## 10 2 CCAGCT 1
## # ... with 1,843 more rows